Exploratory Data Analysis of Red Wine

Adam McCarthy

Red wine dataset consiting of 1599 variants of the Portuguese “Vinho Verde” wine. Includes physicochemical (inputs) and sensory (the output) variables. This dataset is public available for research. The details are described in [Cortez et al., 2009].

Univariate Plots Section

Overview: There are 13 variables within the original dataset. Of these one is a dependent variable giving a subjective measure of quality based on experts sensory reviw of the wine. The main 11 variables are independent physiochemical tests. These may be inter-related but are initially thought of as individual measurements.

Input variables (based on physicochemical tests):

Output variable (based on sensory data):

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality_c  
##  Min.   : 8.40   Min.   :3.000   lower:744  
##  1st Qu.: 9.50   1st Qu.:5.000   upper:855  
##  Median :10.20   Median :6.000              
##  Mean   :10.42   Mean   :5.636              
##  3rd Qu.:11.10   3rd Qu.:6.000              
##  Max.   :14.90   Max.   :8.000

The summary table depicts descriptive statistics for all of the variables.

Quality is measured from 0-10 based. However the dataset is limited to values between 3 and 8.

One categorical variables within the dataset is generated to group quality into two categories. Lower (<=5) and Upper (>5). This will be used to explore for any major differences in values.

Ph is on a scale of 0 to 14 (potential of hydrogen), all values here are acidic (<7) with the range occuring between 2.7-4.

Alcholol is measured in percentages.

The other variables are all numeric data types.

There are no missing values within the dataset.

Quality is between 2 and 8 despite the scoring range being 0-10. The majority of values are 5 and 6. There is skew to the right tail giving more positive scores in 7. Giving the poor sampling of scores outside of 5 and 6 the quality measure will be split into two categories, upper (>5) and lower (<=5).

Alcohol also has a skewed distribution of values but with most values located between 9% and 13%. A small histogram can give a quick impression of the distribution of data more effectively than reading the descriptive statistics.

Investigate further, 9 variables are placed into histograms to understand their distributions. Of these pH, Fixed acidity and Volatile Acidity have slightly normal distributions, these could be investigated further using quantile quantile plots.

Residual Sugar and Chlorides both appear to have highly skewed, over dispersed data, these could be explored using a transformation to investigate the distribution of values further.

Citric Acidity has two spikes, one occuring at 0 and one around 0.5. Bin size could be changed to investigate if there is some artifact in the data related to rounding.

Density appears relatively normal but with occasional spikes, these should be checked for rounding issues.

Free sulphar dioxide and total sulphar dioxide have skewed distributions, with a sparse population of values towards the maximum.

Sulphates also has over disperssed values to the right with a skewed distribution. The rug plot helps highlight the low occurence of these values in two clusters, close to 2 and close to 1.6.

Using this plot, pH can be examined in greater detail and compared to an idealised normal distribution. pH appears to have light tails as displayed by the overall sigmoidal shape of the points, it displays very fine grained clustering which is related to rounding of values and it shows an overall skew to the right.

Towards the upper right of the quantile quantile plot against a standard normal distribution the line crosses the 95% confidence intervals.

Density also shows light tails and skew to the right. The fit for a normal distribution is outside of the 95% confidence intervals.

To improve visualisation of over disperssed variables data transformations can be applied. The original plot followed by a square root and log10 data transformation is conducted on residual sugar in the above three plots. The data transformations give a better representation of the distribution of values. They still have long tails towards the right in this case.

Other over disperssed variables are explored in the above plot and each is matched to a data transformation that better represents the distribution of data values. Chlorides shows a slightly normal distribution with a long tail to the right, it is centered nicely after a log10 transformation.

Citric acid still has a large count of values at close to 0, the data distribution is best represented using a square root transformation, this does not bring the data closer to a normal distribution but represents the the distribution of values across the range.

Total sulfur dioxide is transformed using log10 given a wide normal styled distribution.

Free sulfure dioxide puts too many values to the right hand side during a log10 transformation so even with a skew, square root gives a better distribution of values.

Sulphates gives a better distribution of values but is still skewed.

Each of these transformations can be considered when comparing these variables to other variables in further stages of the anlaysis.

Univariate Analysis

What is the structure of your dataset?

The dataset is in a tidy format. Each observation corresponds to a series of variables.

The dataset consists of one key dependent variable, quality. This is based on the subjective sensory assesment of experts. This should result in a value between 0-10, within this dataset the majority of the samples exist with results of 5 or 6.

The other eleven variables are all measurements. It is assumed that these are indpendent, although there may be relationships between some of these evelen variables.

What is/are the main feature(s) of interest in your dataset?

The main feauture of interest is the dependent variable quality.

Without further exploratory work it can not be assesed which of the 11 measurement variables is the most important.

What other features in the dataset do you think will help support your
investigation?

At the moment all other features are of interest as this investigation is not applying any domain knowledge about what is most important.

Did you create any new variables from existing variables in the dataset?

One categorical variable is created based on the dependent quality variable. This splits quality into two categories, upper and lower. The purpose of this is to represent better (upper) and worse (lower) wines. This is split in two due to the limited distribution of values outside of 5 and 6. The idea is to use this new variable to see if there are any relationships in other variables that seperate better or worse wines.

Of the features you investigated, were there any unusual distributions?

There appear to be no normal distributions in this dataset, two are close but the qq plots have shown they sit outside of an idealised normal distribution.

Rounding of inputs appears to cause minor clustering within the datasets, this may relate to the tool precision used to assess the physiochemical properties of each wine.

Citric acid has a large number of values close to zero, this could be further investigated to see if this is a data quality issue or true signal.

Variables can depict over disperssed distributions with long tails to the right , using either square root or log10 transformations can help better represent the distribtion of these values. Each of these as been investigated to identify which transformation should be applied for future plots.

Bivariate Plots Section

This section will begin with assesing if there any variables with a strong difference between upper and lower wine quality reviews.

This section will also check if any of the measurement variables are related.

Frequency polygons showing the distribution of values but now split by groups of the dependent variable quality can help show if there are trends in any of the variables and the output variables.

In the above figure alcohol shows a large count of wines with lower quality measurements are associated to lower alcohol values.

Sulphates also displays a difference between the upper and lower categories. In which higher values of sulphates appear to exclude the lower quality scores.

By plotting the remaining variables in a similar way it is possible to quickly identify any variables that appear to have a stronger relationship to the upper or lower quality group. The first observation is that many of the variables show little difference between the frequency polygons for upper or for lower. Those that do show some difference, but it does not appear to be large or very obvious at first inspection.

Volatile acidity shows a distribution centered to the left for upper. The frequency polygon shows lower values for upper than the lower category.

Total sulfar dioxide has none of it´s upper values in the upper group.

Both free sulfar dioxide and total sulfar dioxide show some two peaks for the upper group (this may relate to count as this is not a density plot).

Citric acid has a spike in it´s higher values for the upper group.

These issues will be worth investigating further.

The pair plot gives all combinations of bivariate analysis. It´s main limitation is that it is not created using the data transformations previously identified. This acts as a usefull way to look up pairs of values to investigate further.

As previously addressed by the bivariate frequency polygon plot very few of the datasets impact the upper and lower categories of quality significantly apart from alcohol. This can be viewed in the box plots and paired histograms. Alcohol has a 0.48 correlation to quality on this plot. Volatile acidity has a negative 0.39 and sulphates has 0.25. These are variables with the highest correlation to the quality variable.

Free sulphar and total sulphar have a correlation of 0.67 but the two variables have little correlation to other variables.

Density appears to have some weak correlation with a few variables like residual sugar and fixed acidity. This could be a candidate for multi-variate analysis.

Chlorides and residual sugars should be checked as these are both highly dispersed variables so it is difficult to see any correlation in this pair plot.

## `geom_smooth()` using method = 'gam'

Total vs. residual sulfar dioxide is plotted with both variables on a log10 scale to highlight the correlation (0.67) between the two variables. There are a handfull of outliers but this appears to show a relationship.

Density plotted against fixed acidity shows a relationship between the two variables. This has been fitted with a linear model.

Density comapred to alcohol shows a negative relationship, it does not fit well using a linear model. There is a cluster of values with low alcohol. It shows a good spread of values.

Chlorides and residual sugar are compared with a log10 transform but there still appears to be limited correlation between these two variables.

The comparisson of sulphates to volatile acidity shows a weak negative relationship. It shows a good disperssion of values when both axis are on log10. A linear model does not fit well to this plot.

The fixed compared to volatile acidity shows a negative correlation.

Citric vs fixed acidity shows a positive relationship.

Comparing sulphates to quality values gives an intitial appearence of a correlation. In the following series of plots category 5 and 6 are always key as they have a much higher proportion of values. Category 5 and 6 have a wider distribution of values of than the other categories. Despite the apparent trend between high quality and low quality values this may not be signal given the low number of samples in the highest and lowest values. Still there is a complete difference between categories 3 sulphate values and category 8 sulphate values.

Linear models are not used in these plots as they will always be swayed by the high number of values within quality values 5 and 6.

Comparing the dependent variable, quality, to variables that have a correlation to it includes volatile acidity. The plot shows each category of quality on the y axis and then a scatter of plots showing the distribution of volatile acidity. There is a weak correlation but what is more interesting is which values do not occur in certain classes. For example there are no values above 0.9 for the top scored wines. For both the high and low quality scores the challenge is in the low sample of variables.

Alcohol compared to quality shows a positive correlation. The values are dominatly within the 6 and 5 scores. Lowest scoring wines often have lower alcohol values. While the highest scores tend towards the higher values. A key observation is cateogry six which shows a range of values across the alcohol values. This puts into question the strength of this correlation that higher alcohol equals higher quality of wine.

Comparing two variables using a quantile quantile plot (Empirical QQ plot) can give information about how different the variables are. In this case alcohol is split between upper quality values (>5) and lower quality values (<=5).

The plot and confidence intervals suggest that this the difference between the two groups of alcohol samples is not significant as the confidence bands and values intersects the 1-1 line.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Overall there are few strong correlations between variables in this dataset. The most promising line of investigation seems to be alcohol and it´s relationship to quality.

Two other variables (Volatile acidity and sulphates) have a weak relationship to the feature of interest.

All have been investigated, the challenge relates to the low proportion of sampels in the highest and lowest values compared to the quality scores of 5 and 6. Often these values show a wider range of values in each independent variable putting to question if the relationships between higher and lower quality scores are signal or noise.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Few strong relationships exist, density appears to have associations to multiple variables and can be investigated further through multivariate analysis.

A number of plots have outliers, it would be interesting to know if there are clusters within the dataset or any other types of structures that can not be observed through bivariate analysis.

What was the strongest relationship you found?

Total and free sulfar dioxide have the strongest relationship found, this would make sense if free sulfar dioxide has a proportional relationship to the total amount of sulfar dioxide in a sample.

Density to fixed acidity has the same correlation value in the pair plot as total and free sulfar dioxide.

Multivariate Plots Section

By using the ratio of pairs of correlated independent variables more dimensions of the dataset can be viewed at the same time. These paired relationships are based on the bivariate analysis.

## NULL

Based on bivariate exploration of variables a number of candidates exist to explore as ratios. Ratios allow for combining two variables with a relationship. This can allow for more multivariate exploration, each plot will combine two pairs of variables to try and see if the quality of wine stands out in some sort of pattern or relationship.

The plot is double encoded for each quality cateogry with both a colour and a different symbol, this helps emphasise the rarer values of 3, 4, 7 and 8 which can be obscured when only using colour.

The upper right plot uses the relationship between total and free sulfar dioxide and the relationship between density andfixed acidity. Overall there is weak suggestions that this is aiding in sperating quality values. The are some weak clusters (e.g. value 8) between bottom left and top right. This is likely reflecting that sulphar dioxide is not having a considerable impact on the quality value.

The top right plot uses sulphates and volatile acidity transformed to log10 on the y axis with density and alcohol on the x axis. Both pairs of variables have some correlation and each of these variables has been previously noted as being correlated to quality. This plot does show a relationship to quality values with higher quality values being associated to a low density/alcohol ratio and a positive sulphates/volatile acidity ratio. This is a noisy relationship with a lot of overlap, especially from values 5 and 6 which are the most populated.

The lower left plot is similar to the last plot described but swaps density/alcohol ratio for density/fixed acidity. This shows a similar relationship as the previous plot but not as strongly along the x axis.

The lower right plot uses a ratio of sulphates and chlorides with a log10 transformation on the y axis with a ratio of pH and alcohol on the x axis. This also shows a relationship where a high ratio of sulphates and chlorides along with a lower ratio of pH and alcohol suggest more higher quality wines.

Based on multivariate exploration two pairs of ratios are selected and plotted with upper and lower quality categories as colour and quality as the size in this bubble plot. The size of the shape is subtle due to the dominance of values 5 and 6. It helps highlight the lowest scoring values.

This plot shows there is relationship within the data that can begin to seperate some of the lower and upper quality wines. It also shows there is significant overlap.

The last plot uses a bubble chart seperated by each factor of quality. The axis are based on the same pairs of ratios as previously used to highlight differences between upper and lower quality wines.

The ratio of fixed acidty to citric acid controls the size but shows no imediate patterns within this plot, other than some lower values in quality values 3, 4 and 8.

The main purpose of this plot is to again show the challenges of the number of samples outside of values 5 and 6. Having a more even sample of values in the higher or lower values would increase the confidence that relatioship previously described are reasonable and can be pursued further.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

To assess the dependent variable quality the most productive investigation was using ratio of correlated variables.

The choosen pair of density/alcohol and sulphates/volatile acidity (using a log10 transform) proved most effective at sperating upper and lower quality wines. Neither of these pairs have the strongest correlation but both pairs showed a good dsitribution of values which may be why they work well as ratios to describe variations in the dataset.

Were there any interesting or surprising interactions between features?

Despite selecting four variables to highlight the relationship different combinations of varibales each gave different insights into the dataset.

It was suprising to see the relationship using upper and lower quality considering that bivariate and univariate analysis has so far only shown weak relationships.

Principal component analysis could be used to expand this investigation.

Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No models have been created due to ambiguity between the dependent and independent variables. Machine learning could create a model based on the observations.

Trying to get a better sample of across quality values should be prioritised before trying sophisticated modelling in this case.


Final Plots and Summary

Plot One

Description One

The dependent variable within this dataset is quality measured between 0-10. The majority of samples lie between scores of 5 and 6. 2 and 3 is very poorly sampled and 2 to 0, 9 & 10 do not occur in the dataset. This makes it challenging to identify robust relationships between independent variables. This plot highlights that there is no information outside of the range 3-8.

Alcohol is one of the variables which have an apparent relationship with quality. At first look there appears to be a trend suggesting lower alcohol, lower quality and vice versa.

Plot Two

Description Two

Comparing two variables using a quantile quantile plot (Empirical QQ plot) can give information about if the difference is significant. In this case alcohol is split between upper quality values (>5) and lower quality values (<=5). This is choosen as alcohol has the stronget correlation to quality (0.48) and the observation made when comparing frequency polygons between these two groups.

The steps in the plot will relate to rouding of measurement. The majority of the values between 10% and 12.5% are seperated with higher values in the upper quality group. The 95% confidence bands highlight that this is not a significant observation as the two tails intersect the 1-1 line.

This plot is choosen to highlight that the independent variable with the strongest correlation does not show a significant relationship between upper and lower quality groups.

Plot Three

Description Three

Based on multivariate exploration of many combinations of pairs of ratios two are selected to highligth trends in the data. Plotted with upper and lower quality categories as colour and quality as the size in this bubble plot. The size of the shape is subtle due to the dominance of values 5 and 6. It helps highlight the lowest scoring values.

The two ratios used are density/alcohol along the x axis and sulphates/volatile acidity with a log10 transform. There are weak relationship between each of these pairs and all variables have been seen to have weak ot moderate impact on the dependent quality variable.

This plot shows there is relationship within the data that can begin to seperate some of the lower and upper quality wines. Machine learning approaches could fit a model to seperate these groups of values. It also shows there is significant overlap, which will hamper the performance of any model created. Expectations should not be too high about having any high quality predictive model from this dataset. ——

Reflection

By systematically working through all of the variables in a univariate and bivariate analysis this allowed for the multivariate analysis to be steered in a way that has shown a relationship to the dependent variable.

This was much more effective than just playing with advanced multivariate graphs to plot many variables which is often a temptation.

The struggle with this dataset was not having any domain knoweldge about the topic and then the lack of correlated variables. Perhaps there are ways to better identify a relationship between quality and other variables with more knowledge about the measurement types, however I feel this analysis has explored most of the potential combinations of variables.

The other key issue was the low sample rate of quality outside of 5 and 6. This dataset is a good case where gathering a broader sample of observations should be prioritised before commiting time to other methodologies like machine learning.

Dataset from:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016) Pre-press (pdf) bib